Measurement models and Confirmatory Factor Analysis

Structural Equation Modeling

Tommaso Feraco

Invalid Date

Outline

  • Intro
  • CFA
  • Identification and fit
  • CFA and validity - R
  • Exercise 3.2
  • A neglected implication
  • Readings

Factor analysis

  • Factor analysis is a statistical technique widely used in the social sciences
  • It is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.
  • In its original definition (Spearman, 1904) the relationship between observed and latent variables is not defined a priori (Exploratory Factor Analysis, EFA).
  • In SEM framework, the researcher first develops a hypothesis about what factors s/he believes are underlying the measures s/he has used and may impose constraints on the model based on these a priori hypotheses (Confirmatory Factor Analysis, CFA).

Exploratory Factor Analysis (EFA)

You can run an exploratory factor analysis in R using the in-built function factanal. (you can also do it in lavaan: efa())

  • The number of latent factor is not predetermined
  • All latent variables are free to influence all the observed variables
  • Measurement errors are not allowed to correlate

Is this a good way to find latent variables? YOUR OPINION?

Confirmatory Factor Analysis (CFA)

You can run an CFA in R using the lavaan functions cfa.

  • We personally determine the model and the latent variables a priori
  • Latent variables only affect predefined observed variables
  • Measurement errors may correlate

Is this a good way to find latent variables? YOUR OPINION?

General formula

  • The general model for confirmatory factor analysis can be written as:

\[ \begin{aligned} \mathbf{x} &= \mathbf{\Lambda}_x\,\mathbf{\xi} + \mathbf{\delta} \\ \mathbf{y} &= \mathbf{\Lambda}_y\,\mathbf{\eta} + \mathbf{\epsilon} \end{aligned} \]

where \(\mathbf{x}\) and \(\mathbf{y}\) are observed variables, \(\mathbf{\xi}\) and \(\mathbf{\eta}\) are latent factors, and \(\mathbf{\delta}\) and \(\mathbf{\epsilon}\) are errors of measurement.

  • The coefficients in \(\mathbf{\Lambda}_x\) and \(\mathbf{\Lambda}_y\) describe the effects of the latent variables on the observed variables.

The general formula explained

\[ \begin{aligned} y &= b_0 + b_1 x + \epsilon \\ y_1 &= \tau_1 + \lambda_1\eta + \epsilon_1 \end{aligned} \]

\[ \begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix} = \begin{bmatrix} \tau_1 \\ \tau_2 \\ \tau_3 \end{bmatrix} + \begin{bmatrix} \lambda_1 \\ \lambda_2 \\ \lambda_3 \end{bmatrix} (\eta_1) \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{bmatrix} \]

\[ \begin{aligned} y_1 &= \tau_1 + \lambda_1\eta_1 + \epsilon_1 \\ y_2 &= \tau_2 + \lambda_2\eta_1 + \epsilon_2 \\ y_3 &= \tau_3 + \lambda_3\eta_1 + \epsilon_3 \end{aligned} \]

Reflective variables in a realist definition

When we talk about an effect \(\mathbf{\Lambda}\) (\(\Rightarrow\)) of a latent variable on an observed variable (\(\mathbf{x}\)), we are basing our model on a realist framework of reflective latent variables. In other words, we assume that:

  • ARROWS are ARROWS: it is the latent construct that affects the observed responses
  • A realist interpretation is needed: the latent variable is something that really exists!
  • Observations are things that are really affected/produced by the latent variable + some error

Pragmatic interpretation of latent variables are of no help: “a factor model is just a good way of summarizing a large number of items”.

Reflective variables in a realist definition

REMEMBER: every statistical method applied to psychology has theoretical implications that not only concerns the results obtained. The selected method/model has implication on its own, and CFA is no exception!

If you do not want to assume any realist position or reflective assumptions on your latent variables, you should adopt other methods of data reduction:

  • PCA
  • EGA

One-factor model

Two-factor model with correlated variables

Two-factor model with hortogonal variables

Hierarchical model

In R

m <- "
latent1 =~ x1 + x2 + x3 
latent2 =~ x4 + x5 + x6
"
semPlot::semPaths(fit)

  • The first latent variable explains item 1,2,3
  • The second latent variable explains item 4,5,6
  • The diagram represents the hypothesized model

Matrices

Lambda: matrix of loadings

Phi: latent variance-covariance matrix

Theta: observed variance-covariance matrix

Constraints

In order to estimate the parameters in structural equation models with latent variables, you must set some identification constraints in these models. Otherwise, you won’t be able to estimate the variables (the model is not identifiable).

We can choose one of the two following strategies:

  • To standardize latent variables such that factor means are fixed to 0 and factor variances are fixed to 1.
  • To set to one a loading (\(\lambda\)) for each latent variable.

In R, the function sem() or cfa() uses the second strategy as default. To change it, use the std.lv option to TRUE.

fit <- sem(m, std.lv = TRUE, ...)

Constraints explained

\[ \Sigma(\eta) = \Lambda\Psi\Lambda' + \Theta_{\epsilon} \]

Marker method

\[ \Sigma(\eta) = \psi_{11} \begin{bmatrix} 1 \\ \lambda_2 \\ \lambda_3 \end{bmatrix} (1\,\lambda_2\,\lambda_3) \begin{bmatrix} \theta_{11} & 0 & 0 \\ 0 & \theta_{22} & 0 \\ 0 & 0 & \theta_{33} \end{bmatrix} \]

Standardization

\[ \Sigma(\eta) = (1) \begin{bmatrix} \lambda_1 \\ \lambda_2 \\ \lambda_3 \end{bmatrix} (\lambda_1\,\lambda_2\,\lambda_3) \begin{bmatrix} \theta_{11} & 0 & 0 \\ 0 & \theta_{22} & 0 \\ 0 & 0 & \theta_{33} \end{bmatrix} \]

Identification rules

If you remember, we talked about identification in the Introduction. For CFA models, the following identification rules can be followed:

  • the t-rule
  • the Three-Indicator Rules
  • the Two-Indicator Rules

the t-rule

We have already seen it. This is a necessary but not sufficient condition:

\[ t \leq \frac{q(q+1)}{2} \]

where \(t\) is the number of free parameters and \(q\) the number of observed variables.

In this case:

  • The number of free parameters (\(t\)) must be less or equal to the number of nonredundant elements in the covariance matrix of the observed variables

In other words: the number of nonredundant elements in \(\mathbf{S}\) is the maximum number of possible equations; if the number of unknowns exceeds the number of equations, the identification of \(\mathbf{\theta}\) is not possible.

the Three-Indicator Rules

The three-indicatore rules is a sufficient but not necessary condition. It poses no restrictions on \(\mathbf{\Phi}\) (the var-covar of exogenous latent variables)

  1. A sufficient condition to identify a one-factor model is to have at least three indicators with nonzero loadings (\(\lambda\)) and \(\mathbf{\Theta}\) diagonal.

  2. A multifactor model is identified when:

    1. It has three or more indicators per latent variable.
    2. Each row of \(\mathbf{\Lambda}\) has one and only one nonzero element.
    3. \(\mathbf{\Theta}\) is diagonal.

the Two-Indicator Rules

The two-indicatore rules is a sufficient but not necessary condition for models with more than one \(\mathbf{\xi}\).

  • \(\mathbf{\Theta}\) is diagonal

  • Each latent variable is scaled (one \(\lambda_{ij}\) set to 1 for each \(\mathbf{\xi}\)).

  • It requires the following conditions:

    1. There are at least two indicators per latent variable
    2. Each row of \(\mathbf{\Lambda}\) has one and only one nonzero element
    3. \(\mathbf{\Theta}\) is diagonal
    4. Each row of \(\mathbf{\Phi}\) has at least one nonzero off-diagonal element

Model fit

As before, we can evaluate model fit of a CFA using:

  • \(\chi^2\) test
  • Absolute fit indices
inspect(fit, "fit")[c("gfi", "agfi")]
  • Absolute fit indices based on residuals
inspect(fit, "fit")[c("srmr", "rmsea")]
  • Incremental fit indices
inspect(fit, "fit")[c("cfi", "nnfi")]
  • Information criterion based indices
inspect(fit, "fit")[c("aic", "bic")]
  • \(R^2\) and the total coefficient of determination
inspect(fit, "rsquare")

The total coefficient of determination

While \(R^2\) gives the portion of explained variance in single dependent variables

inspect(fit, "rsquare")
  x1   x2   x3   x4   x5   x6 
0.27 0.46 0.32 0.31 0.40 0.29 

… the total coefficient of determination represents the proportion of variance in the dependent variables that is explained by all the variables in the model, both directly and indirectly.

TH <- inspect(fit, "estimates")$theta
S <- fitted(fit)$cov
1 - det(TH) / det(S)
[1] 0.85

Introduction to CFA and validity

Whenever we estimate a latent variable, we are MEASURING a latent trait that explains observed (or other latent) factors.

In other words, CFA is a tool that is used to measure constructs that are not directly observable (remember the realist framework).

Step 1: the construct

… this is not the topic of this course but, after theoretical reflections, answer at least these questions that will guide your modeling and draw it:

  • is it unidimensional?
  • is it multidimensional?
  • are the factors correlated?
  • is it hierarchic?
  • has it a bifactor structure?

All these answers have statistical and theoretical consequences / assumptions.

Step 2: items and scale construction

Assuming that the construct exists, you need

  • an explicit, precise definition of the measured attribute or construct
  • a set of items sensible to variations of the measured attribute or construct

In fact, we assume that the observations (item responses) should change according to modifications of the latent trait.

Items should (possibly) cover all the aspects of the construct.

To help your work, there are tools that can be used:

  • Spoto et al., 2023 https://doi.org/10.1037/met0000545

this of course happens if the questionnaire/test is new (or if you want to develop a new version)

Step 3: collect the data!

Data collection is not obvious and follows your previous decisions. We might plan:

  • 2 data collections (cfa measurement + nomological network)
  • 3 data collections (efa + cfa + nomological network)
  • focus groups + pilot on item comprension + […]
  • […]
  • […]

Step 4: data analysis (CFA only)

Of all the possible options, we will only focus on the CFA.

Imagine we have collected data for 862 participants using the WISC-IV, one of the most famous tests of intelligence. It comprises 15 subtests measuring:

  • VCI: verbal comprehension ind
  • SI: Similarities
  • VC: Vocabulary
  • CO: Comprehension
  • WMI: working memory index
  • DS: Digit span
  • LN: Letter-Number seq.
  • PRI: perceptual reasoning index
  • BD: Block design
  • PCn: Picture concepts
  • MR: Matrix reasoning
  • PSI: processing speed index
  • CD: Coding
  • SS: Symbol search

The subtests are assumed to belong to specific abilities (bold), that are influenced by a general factor: we have a hierarchical structure.

Open the data

Ops, the data are not in standard form!

# Exercise 3.1
load("../data/Exercise3_1.Rdata")
# view(d)
BD SI DS PCn CD VC LN MR CO SS
1.00 0.38 0.26 0.34 0.25 0.33 0.29 0.42 0.27 0.30
0.38 1.00 0.35 0.43 0.14 0.62 0.35 0.41 0.51 0.27
0.26 0.35 1.00 0.28 0.15 0.33 0.42 0.29 0.24 0.20
0.34 0.43 0.28 1.00 0.11 0.41 0.35 0.43 0.35 0.24
0.25 0.14 0.15 0.11 1.00 0.13 0.19 0.20 0.15 0.46
0.33 0.62 0.33 0.41 0.13 1.00 0.38 0.40 0.59 0.24
0.29 0.35 0.42 0.35 0.19 0.38 1.00 0.35 0.30 0.24
0.42 0.41 0.29 0.43 0.20 0.40 0.35 1.00 0.30 0.26
0.27 0.51 0.24 0.35 0.15 0.59 0.30 0.30 1.00 0.22
0.30 0.27 0.20 0.24 0.46 0.24 0.24 0.26 0.22 1.00

The theoretical model

Intelligence theory suppose that test scores are affected by specific abilities (e.g., processing speed), that are directly influenced by an overarching latent factor (g)

Try to write the model

Specify and fit the mode

We are skipping some steps (validity of single tests and of first-order abilities) … you cannot do it!

m <- "
VCI=~SI+VC+CO
PRI=~BD+PCn+MR
WMI=~DS+LN
PSI=~CD+SS
g=~VCI+PRI+WMI+PSI
"
fit <- sem(m, std.lv = TRUE, sample.cov = d, sample.nobs = N)

Model parameters

parameterestimates(fit, standardized = TRUE)[1:14, 1:11]
   lhs op rhs  est    se    z pvalue ci.lower ci.upper std.lv std.all
1  VCI =~  SI 0.46 0.032 14.4  0.000    0.394     0.52   0.77    0.77
2  VCI =~  VC 0.48 0.033 14.5  0.000    0.419     0.55   0.82    0.82
3  VCI =~  CO 0.41 0.030 13.6  0.000    0.348     0.46   0.68    0.69
4  PRI =~  BD 0.19 0.052  3.6  0.000    0.087     0.29   0.58    0.59
5  PRI =~ PCn 0.21 0.057  3.6  0.000    0.095     0.32   0.64    0.64
6  PRI =~  MR 0.22 0.060  3.6  0.000    0.100     0.33   0.67    0.67
7  WMI =~  DS 0.36 0.038  9.5  0.000    0.284     0.43   0.60    0.60
8  WMI =~  LN 0.42 0.045  9.2  0.000    0.327     0.50   0.70    0.70
9  PSI =~  CD 0.48 0.037 12.9  0.000    0.405     0.55   0.55    0.55
10 PSI =~  SS 0.71 0.059 12.1  0.000    0.599     0.83   0.83    0.83
11   g =~ VCI 1.36 0.129 10.5  0.000    1.107     1.61   0.80    0.80
12   g =~ PRI 2.92 0.865  3.4  0.001    1.220     4.61   0.95    0.95
13   g =~ WMI 1.35 0.171  7.9  0.000    1.013     1.68   0.80    0.80
14   g =~ PSI 0.59 0.067  8.8  0.000    0.457     0.72   0.51    0.51
# [...]

Model parameters

# [...]
parameterestimates(fit, standardized = TRUE)[15:20, 1:11]
   lhs op rhs  est    se  z pvalue ci.lower ci.upper std.lv std.all
15  SI ~~  SI 0.40 0.028 14      0     0.35     0.46   0.40    0.41
16  VC ~~  VC 0.33 0.027 12      0     0.28     0.38   0.33    0.33
17  CO ~~  CO 0.53 0.031 17      0     0.47     0.59   0.53    0.53
18  BD ~~  BD 0.66 0.037 18      0     0.58     0.73   0.66    0.66
19 PCn ~~ PCn 0.59 0.036 16      0     0.52     0.66   0.59    0.59
20  MR ~~  MR 0.55 0.035 16      0     0.48     0.62   0.55    0.55

Model fit

fi <- c("cfi", "tli", "nnfi", "agfi", "srmr", "rmsea")
round(inspect(fit, "fit")[fi], 3)
  cfi   tli  nnfi  agfi  srmr rmsea 
0.985 0.978 0.978 0.973 0.028 0.037 

Reliability

semTools::reliability(fit)
        VCI  PRI  WMI  PSI
alpha  0.80 0.66 0.59 0.63
omega  0.80 0.67 0.59 0.66
omega2 0.80 0.67 0.59 0.66
omega3 0.80 0.67 0.59 0.66
avevar 0.58 0.40 0.42 0.50
semTools::reliabilityL2(fit, secondFactor = "g")
       omegaL1        omegaL2 partialOmegaL1 
          0.75           0.91           0.85 

Graphical representation

semPlot::semPaths(
  fit,
  edge.label.cex = .8,
  what = "std",
  sizeMan = 7,
  sizeLat = 7,
  edge.color = "black",
  edge.label.color = "black"
)

A second theoretical model

Parallel theories of intelligence suppose that test scores are affected by a general factor (g) AND by specific abilities that explain the remaining variance. Both type of factors directly influence observed scores.

All factors are set to be orthogonal!

Bifactor model in R

# Modello bifattoriale
mb <- "
VCI=~a*SI+a*VC+a*CO
PRI=~b*BD+b*PCn+b*MR
WMI=~c*DS+c*LN
PSI=~d*CD+d*SS
g=~SI+VC+CO+BD+PCn+MR+DS+LN+CD+SS
"

fitb <- sem(
  mb,
  orthogonal = TRUE,
  std.lv = TRUE,
  sample.cov = d,
  sample.nobs = N
)

This model fits the data like the previous one.

COMMENTS? QUESTIONS?

Exercise 3.2

# 2. Exercise 3.2 - working with real data and Likert scales
# The dataset contains data collected from 1083 students on one questionnaire
# The questionnaire aims to measure adaptability with 9 items on a 7-point scale
# The first column is just the student's id
D.ad <- read.csv("../data/Exercise3_2.csv")

# We want to test the factorial validity of the Italian questionnaire
# Martin et al., 2012 hypothesize three subscales:
# (behavior [1:3], cognition [4:6], and emotion[7:9])
# But found 1 or 2 factors in an EFA:
# (cognitive-bahavioral [1:6] AND affective [7:9])
# Later, they tested these models with a CFA
# Test the two models, compare them and make your decisions

Predictions with latent variables

This will directly bring us to the next set of slides, but some questions before:

  • Can we use a latent variable to ‘predict’ another variable?

Predictions with latent variables

This will directly bring us to the next set of slides, but some questions before:

  • Can we use a latent variable to ‘predict’ another variable?
  • How (in R)?

Predictions with latent variables

This will directly bring us to the next set of slides, but some questions before:

  • Can we use a latent variable to ‘predict’ another variable?
  • How (in R)?
  • After we confirm that a latent variable ‘exists’, can we use manifest variables as predictors?

LET'S SIMULATE

Predictions with latent variables

This will directly bring us to the next set of slides, but some questions before:

  • Can we use a latent variable to ‘predict’ another variable?
  • How (in R)?
  • After we confirm that a latent variable ‘exists’, can we use manifest variables as predictors?
  • Can we use residuals as predictors?

Suggested readings

  • Best practices for scale development: https://doi.org/10.3389/fpubh.2018.00149
  • Content validity: https://doi.org/10.1037/met0000545
  • (as always) Latent Variable Modeling Using R: A Step-by-Step Guide (Beaujean, 2014)

References